查看原文
其他

研究单个基因的生物信息学分析工具(大全)

2016-05-27 小丫 嘉因生物

为什么要强调是“研究单个基因”的生物信息工具呢?

因为研究单个基因的工具跟基因组水平的分析工具有很大不同。


例如:克隆基因时,设计引物,加酶切位点。

找生物信息科班出身的人来做吗?他可以在全基因组水平查找酶切位点,但让他给某个序列加上合适的酶切位点,并跟载体上的多克隆位点相匹配,他可能都不见得听得懂。

研究单个基因的人都知道,用Primer Premier就能搞定。


再例如:当你需要分析某个蛋白的磷酸化位点时,想知道哪个生物信息学工具可以用来做预测,你问小哈,他也蒙,这就不是他的菜。

因为小哈是研究转录水平调控的,他擅长结合转录组RNA-seq数据和表观遗传组ChIP-seq、DNase-seq等数据分析基因的转录调控。

小哈想要做蛋白的磷酸化位点预测,也需要上网找工具,多找几个,对比预测效果;读文献,搜论坛,看别人对这些工具的评价,这些都不能少。



推荐一个网站,生物信息学分析工具的黄页。

http://molbiol-tools.ca

加拿大人做的,适合大多数分子生物学实验室,对研究omics的人也同样有帮助。



例如DNA MOTIFS,小丫总结过2种查看转录因子结合位点的方法,一种是基于实验证据的,另一种是基于motif预测的,直接点击查看


这个网站提供了丰富的DNA motif分析工具:

While one can use established lists of motifs to search one's DNA sequence one can also discover them directly. In order to do this one has to derive a consensus sequence or probability matrix.  In the case of bacterial proteins for which the binding sites have been determined good places to start are the   (A.M. McGuire, Harvard University, U.S.A.), and,   : a database of transcriptional regulation in Bacillus subtilis (University of Torkyo, Japan). The following sites provides one with a training set which can be used to derive a Gibbs screening matrix.

See additional pages on , , and .

An assessment of a set of motif identifiers can be found in .

 Gibbs Motif Sampler Homepage (E.C. Rouchka and B. Thompson, Bioinformatics Laboratory of  Wadsworth Center, U.S.A.) - I have linked to the prokaryotic DNA default setting . On the  I have presented data the IHF-binding site (consensus: WWWTCAA[N4]TTR).

  - info-Gibbs (A. Neuwald & Jacques van Helden, Service de Conformation des Macromolécules Biologiques et de Bioinformatique, Université Libre de Bruxelles, Belgium) - type in the matrix size desired and deselect "add reverse complement strand."  After running the program once I would delete those sequences from the discovery set which align imperfectly.

  (J. Zheng, Queen's University, Canada) - creates a matrix from a DNA Clustal alignment and also presents the consensus:

Number of sequences: 11
Length of alignment: 29
Consensus sequence representing: 80% matching base(s)

A 0  9 11 10 0 4 1 1 2 1 0 1 2 5 1 2 2 2 1 2 1  0  0  11 0  10 8 3 4 
C 0  1 0  0  2 0 1 2 2 5 7 5 3 1 4 4 6 3 1 0 10 11 0  0  1  0  0 4 3 
G 0  0 0  1  8 1 0 0 3 3 1 2 4 2 1 2 3 3 3 9 0  0  11 0  0  1  1 0 2 
T 11 1 0  0  1 6 9 8 4 2 3 3 2 3 5 3 0 3 6 0 0  0  0  0  10 0  2 4 2 

  T  A A  A  S W T Y D B Y B V D Y H S B K G C  C  G  A  T  A  W H V

  - the Sequence Similarities by Markov Chain Monte-Carlo algorithm finds DNA motifs of unknown length and complicated structure in a set of unaligned DNA sequences. It uses an improved motif length estimator and careful Bayesian analysis of the possibility of a site absence in a sequence. Reference: A.V. Favorov et al.. 2005.  Bioinformatics 21: 2240-2245. 

You may also want to consider the 

  (Softberry Inc.) - only two tools exist on the internet for mapping rho-independent terminators FindTerm and TransTerm. You might consider using the advanced feature options and minimally increase the default energy threshold to -12.0.

 Tools to find motif clusters in DNA sequences - one should probably start at  (Dr. Zhiping Weng, Boston University, U.S.A) which has developed a  wide range of tools to interaction between regulatory proteins and their DNA/RNA target sites including:

 
 
 

 Find short split motifs in DNA sequences with  (Reference: Sinha, S. & Tompa, M. 2002. Nucl.Acids Res.)
  - tries to find over-represented motifs (cis-acting regulatory elements) in the upstream region of a set of co- regulated genes. This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif. Be sure to "uncheck" the appropriate box if you don't want the complementary strand included in the analysis. (Reference: G. Thijs et al. 2002. J. Comput. Biol. 9: 447-464.) 

  (A. Villegas, Public Health Ontario) - takes a Genbank flatfile (*.gbk) as input and parses through and for every CDS that it finds, it extracts a pre-determined length of DNA upstream (length will be an argument; and will include 3 nt for the initiation codon). Output will be an FFN file of these upstream DNA sequences.  N.B. this only WORKS for prokaryotic sequences because it does not handle Splits or Joins found in eukaryotic.  This data then can be analyzed with pprograms such as .


 II - Motif Elucidator iNucleotide Sequence Assembly (Human Genome Center, University of Tokyo, Japan) - helps one extract a set of common motifs shared by functionally-related DNA sequences. It  utilizes CONSENSUS, GIBBS DNA, MEME and Coresearch  which are considered to be the most progressive motif search algorithms. Each algorithms is supplied with an impressive set of selection parameters.  

  (Suite for Computational identification OPromoter Elements), an ensemble of programs aimed at identifying novel cis-regulatory elements from groups of upstream sequences. (Reference: J.M. Carlson et al. 2007. Nucl. Acids Res. 35: W259-W264)

  (Predicted Prokaryotic Regulatory Proteins) - including transcription factors (TFs) and two-component systems (TCSs) based upon analysis of DNA or protein sequences. (Reference: Barakat M., 2013. BMC Genomics 14: 269) 

  - This server provides a suite of cis-regulatory motif analysis functions for DNA sequences. (Reference: Q.Ma et al.  2014. Nucleic Acids Res. 42(Web Server issue):W12-9.) 

  is an integrated web server for identifying functional RNA motifs in an input RNA sequence.  These include Splicing sites (donor site; acceptor site); Splicing regulatory motifs(ESE; ESS; ISE; ISS elements); Polyadenylation sites; Transcriptional motifs (rho-independent terminator; TRANSFAC); Translational motifs (ribosome binding sites); UTR motifs (UTRsite patterns); mRNA degradation elements (AU-rich elements); RNA editing sites (C-to-U editing sites); Riboswitches (RiboSW); RNA cis-regulatory elements (Rfam; ERPIN); Similar functional RNA sequences (fRNAdb); RNA-RNA interaction regions (miRNA; ncRNA). (Reference: Chang TH et al. 2013. BMC bioinformatics 14 Suppl 2:S4).

  - A Regulatory RNA Motifs and Elements Finder - RegRNA is an integrated web server for identifying the homologs of regulatory RNA motifs and elements against an input mRNA sequence. Both sequence homologs and structural homologs of regulatory RNA motifs can be recognized. The regulatory RNA motifs supported in RegRNA are categorized into several classes: (i) motifs in mRNA 5'-untranslated region (5'-UTR) and 3'-UTR; (ii) motifs involved in mRNA splicing; (iii) motifs involved in transcriptional regulation; (iv) riboswitches; (v) splicing donor/acceptor sites; (vi) inverted repeats; and (vii) miRNA target sites.(Reference: Huang HY et al. 2006. Nucleic Acids Res. 34(Web Server issue):W429-34).

  (Institute of Bioinformatics, University of Georgia, U.S.A.)- find under- and over-represented short oligonucleotides (di-, tri- and tetranucleotides) in a genome sequence

  AInitio Motif Identification Environment - this tool should be useful for picking up high-copy dispersed repeats, such as repeated extragenic palindrome (REP) elements, CRISPR repeats, uptake signal sequences (DUS/USS), intergenic dyad sequences and several other over-represented sequence motifs  in genome sequences.  (Reference: Mrázek, J. et al. 2008. Bioinformatics 24: 1041-1048).

 Institute of Bioinformatics, University of Georgia, U.S.A.) - Find Frequent Words (oligonucleotides) in a genome sequence

  Analysis of sequence heterogeneity (Institute of Bioinformatics, University of Georgia, U.S.A.) - sliding window plots which allows users to generate sliding window plots of seven different sequence properties:  G + C content; S3 : G + C at codon site 3; d* - differences with respect to genomic average; synonymous codon bias with respect to genomic average; amino acid composition differences with respect to genomic average; (G - C) / (G + C) : G-C skew (A - T) / (A + T) : A-T skew. It is intended for analysis of prokaryotic genomes but it can be applied to eukaryotic chromosomes with some limitations. 

  (Pattern Locator) (Institute of Bioinformatics, University of Georgia, U.S.A.) - is a new tool for finding sequence patterns in long DNA sequences. For this web-based service, a restricted version of Pattern Locator is used, which estimates the time needed for completion of the search and stops if the estimated CPU time exceeds a certain limit (currently 90 seconds). The CPU time limit was introduced in order to protect the web server from overloading due to requests involving too complex sequence patterns.  If you want to search for Sigma-70 (RpoD)-like promoters the pattern syntax for your search is:  <>{TTGACA(N)[15:18]TATAAT}[4].  N.B. the [4] allows for 4 mismatches - I recommend a maximum of two.  If you only want one strand screened omit the <> at the start. You can restrict the search to intergenic regions (but this will eliminate also matches that partially overlap with genes or use the .patvic.txt output file to find where they are (Jan Mrázek, personal communication). 

 


查看《生信小硕乱入生物实验室的幸福生活》系列其他文章,请回复小哈+文章编号,例如回复“小哈1”。

小哈1. 哈师弟的博士研究僧之旅开篇

小哈2. 怎样批量查看lncRNA跟疾病的关系

小哈3. 如何避免批次效应导致的结果不可靠

小哈4. 缺了对照会怎样

小哈5. 家族遗传病如何设计测序实验

小哈6. 遗传病的显隐性、伴性遗传的判断

小哈7. Jane帮你选期刊,选审稿人

小哈8. 用Gnosis直接按影响因子检索paper

小哈9. 组蛋白修饰预示着什么?

小哈10. 药物处理多久后能看到组蛋白修饰的变化?

小哈11. lncRNA上的SNP对其作用机制的影响

小哈12. 需要测多少数据量?read数和G的换算

小哈13. RT-qPCR验证,选哪个lncRNA的哪段设计引物?

小哈14. 研究单个基因的生物信息学分析工具(大全)

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存